We were challenged by our organization to make sense of a set of publicly available data surrounding the spread of the Covid-19 novel coronavirus. Presented with the data, we were required to ask meaningful questions revolving around certain metrics and factors that potentially played a part in the spread of Covid-19, as well as draw insightful conclusions based on any analytical techniques or visualization methodologies we would use on this set of data.
We started off the analysis by visually going over the data. What possible questions and what fundamental conclusions could we draw from the data headings provided? Through preliminary discussion we generated the following questions as a baseline for guidance and direction of our analysis.
We used the Jupyter Notebook platform to process the provided data with Python libraries such as numpy, pandas, and plotly. The thought process and flow of analysis is documented below.
We'll start off by importing the Python libraries we will need to perform this analysis.
# import required Python libraries
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
Load the csv file into a Pandas dataframe.
# Read csv dataset into a Pandas dataframe
covid19_data = pd.read_csv(r'C:\Users\tayj2\Desktop\Untitled Folder\covid-19-data.csv')
# Create a tuple representing the dimensionality of our data
rows, cols = covid19_data.shape
# Print the number of columns and rows
print('Columns: {} Rows: {}'.format(cols, rows, ))
As we can see, our dataset includes 9 columns and 12,381 rows of data.
Let's have a look at what columns of data are included.
# Display all column headers
covid19_data.columns
Some of these column names seem self explanatory. We would expect to see a date in the 'Date' column and the last 6 columns look to be numbers representing quantative data points.
However, there are two columns, Entity and Code, that we can not immeadiatly identfy what the values in those columns represent.
Let's first take a look at what types values are in the Entity column.
# Display all unique values in the Entity column
covid19_data.Entity.unique()
So, the Entity column contatins a lot of unique values that appear to mostly represents country data. However, there are values that do not represent a country. There are four categories of values that represent data other than country specific:
Now let's see what values the Code column has.
# Display all unique values in the Code column
covid19_data.Code.unique()
nan_code = covid19_data[covid19_data['Code'].isna()]
nan_code.Entity.unique()
# Display first 5 rows
covid19_data.head()
# Display last 5 rows
covid19_data.tail()
After reviewing the head and tail of the dataset there are a couple interseting observations:
We will take care of these during the data cleaning process.
After a cursory review of the data in Excel, there is one more issue with the data set we need to address. Since the columns should represent a cumulative value from day to day, we should expect to see a value equal to or greater than the value from the previous day.
However, if we take a look at Austrailia for example, there are missing values. For example, on 3-Apr-20 there were 277,278 tests recorded, but on 4-Apr-20 there is no value recorded. We should expect to see at least 277,278 if no new tests had been recorded.
We will fill out the missing values in the dataset during the data cleaning process.
aus_data = covid19_data[(covid19_data['Code'] == 'AUS') &
((covid19_data['Date'] == '12-Apr-20') |
(covid19_data['Date'] == '13-Apr-20'))]
display(aus_data)
While exploring our dataset we found three areas were we can help improve our anaylsis by cleaning up the data:
covid19_data.head()
# Create a list containing column names we want to replace NAN
fix_nan = ['Cumulative tests',
'Cumulative tests per million',
'Total confirmed cases (cases)',
'Confirmed cases per million (cases per million)',
'Total confirmed deaths (deaths)',
'Confirmed deaths per million (deaths per million)']
# loop through the list of column names and replace NAN with 0.0
for column in fix_nan:
covid19_data[column] = covid19_data[column].fillna(0.0)
covid19_data.head()
# Create new empty dataframe with the same columns from covid19_data
df_columns = covid19_data.columns
covid19_data_complete = pd.DataFrame(columns=df_columns)
# Create a list of all unique entries from covid19_data
entities = covid19_data.Entity.unique()
#create list of columns we want to fill out the data series
columns = ['Cumulative tests',
'Cumulative tests per million',
'Total confirmed cases (cases)',
'Confirmed cases per million (cases per million)',
'Total confirmed deaths (deaths)',
'Confirmed deaths per million (deaths per million)']
# Loop through each unique entity, create a temporary data frame and loop through each column and row.
# If the value in the column is 0.0 update the value to the value in the previous row.
# Append the temporary dataframe to the new covid_data_complete dataframe.
for entity in entities:
temp_df = covid19_data[covid19_data['Entity'] == entity].reset_index(drop=True)
for i in range (1, len(temp_df)):
for column in columns:
if temp_df.loc[i, column] == 0.0:
temp_df.loc[i, column] = temp_df.loc[i-1, column]
covid19_data_complete = covid19_data_complete.append(temp_df, ignore_index=True)
aus_data = covid19_data_complete[(covid19_data_complete['Code'] == 'AUS') &
((covid19_data_complete['Date'] == '12-Apr-20') |
(covid19_data_complete['Date'] == '13-Apr-20'))]
display(aus_data)
# Remove rows that have no test, cases, and death data and create a new dataframe
covid19_data_filtered = covid19_data_complete.drop(
covid19_data_complete[(covid19_data_complete['Cumulative tests'] == 0.0) &
(covid19_data_complete['Cumulative tests per million'] == 0.0) &
(covid19_data_complete['Cumulative tests per million'] == 0.0) &
(covid19_data_complete['Total confirmed cases (cases)'] == 0.0) &
(covid19_data_complete['Confirmed cases per million (cases per million)'] == 0.0) &
(covid19_data_complete['Total confirmed deaths (deaths)'] == 0.0) &
(covid19_data_complete['Confirmed deaths per million (deaths per million)'] == 0.0)
].index)
covid19_data_filtered.head()
# Convert the Date column to the date format YYYY-MM-DD
covid19_data_filtered['Date']= pd.to_datetime(covid19_data_filtered['Date'])
covid19_data_filtered['Date'] = covid19_data_filtered['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Visualizing the spread through a dynamic choropleth will make it easier for viewers to understand the spread of Covid-19 across the world over the time period of 4 months based on the data provided. This section covers the code used to create these visualizations.
# Create dateframe containing only Countries
by_country_data = covid19_data_filtered[covid19_data_filtered['Code'].notna()]
# Remove 'World' from the Coutnries dataframe
by_country_data = by_country_data[by_country_data.Entity != 'World']
by_country_data = by_country_data.sort_values('Date')
display(by_country_data)
most_recent = by_country_data[by_country_data['Date'] == '2020-04-13'].reset_index(drop=True)
display(most_recent)
# Drop rows prior to 2020-01-20. Data before this date is to small to see on the scatter map.
by_country_scattergeo = by_country_data.drop(by_country_data[(by_country_data['Date'] < '2020-01-20')].index)
fig = px.scatter_geo(by_country_scattergeo, locations='Code',
size = 'Total confirmed cases (cases)',
animation_frame = 'Date',
size_max = 100,
opacity = .5,
title = 'Total confirmed cases by country',
)
fig.add_trace(go.Choropleth(
locations = most_recent['Code'],
z = most_recent['Total confirmed cases (cases)'],
colorscale = 'Reds',
))
fig.update_layout(transition = {'duration': 100})
fig.show()
fig = px.scatter_geo(by_country_scattergeo, locations='Code',
size = 'Confirmed cases per million (cases per million)',
animation_frame = 'Date',
size_max = 100,
opacity = .5,
title = 'Confirmed cases per million by country',
)
fig.add_trace(go.Choropleth(
locations = most_recent['Code'],
z = most_recent['Confirmed cases per million (cases per million)'],
colorscale = 'Reds',
))
fig.update_layout(transition = {'duration': 100})
fig.show()
The visualizations show us that although some countries have higher total cases confirmed, this does not equate to severity due to the fact that countries have different population sizes from each other. The second visualization for cases per million by country, shows normalized data that accounts for this fact and presents the data in a much more accurate way, showing precisely which countries were hit harder by the spread of Covid-19 with respect to population size.
We used a racing bar chart to attempt to depict visually how Covid-19 spread across the world over the time period of 4 months. The bar charts show the top 10 countries over time with respect to total number of confirmed cases. This section shows the code we used to try to answer this question.
day = by_country_data['Date'].min()
by_country_data.sort_values(by=['Date', 'Total confirmed cases (cases)'], ascending=[True, False], inplace=True)
days = by_country_data['Date'].unique()
countries = by_country_data['Entity'].unique()
bar_colors = {'China':'rgb(59, 23, 219)',
'United States':'rgb(21, 188, 145)',
'Thailand':'rgb(194, 26, 221)',
'Japan':'rgb(116, 237, 137)',
'South Korea':'rgb(168, 163, 174)',
'Taiwan':'rgb(176, 172, 46)',
'Singapore':'rgb(119, 68, 60)',
'Vietnam':'rgb(245, 102, 89)',
'Malaysia':'rgb(218, 102, 210)',
'France':'rgb(139, 169, 131)',
'Australia':'rgb(236, 21, 239)',
'Germany':'rgb(180, 197, 157)',
'Iran':'rgb(222, 128, 94)',
'Italy':'rgb(73, 90, 242)',
'Bahrain':'rgb(16, 69, 83)',
'Kuwait':'rgb(86, 225, 80)',
'Spain':'rgb(213, 108, 52)',
'Switzerland':'rgb(44, 140, 111)',
'Denmark':'rgb(107, 163, 213)',
'United Kingdom':'rgb(73, 21, 203)',
'Netherlands':'rgb(183, 43, 223)',
'Belgium':'rgb(122, 249, 136)',
'Turkey':'rgb(28, 225, 69)',
}
fig_dict = {
'data': [],
'layout': {},
'frames': []
}
# Create the layout
fig_dict['layout']['title'] ='Total confirmed cases by country - Top 10 (log)'
fig_dict['layout']['xaxis'] = {'title': 'Country'}
fig_dict['layout']['yaxis'] = {'title': 'Total confirmed cases (log)', 'type': 'log'}
fig_dict['layout']['showlegend'] = False
fig_dict['layout']['hovermode'] = 'closest'
fig_dict['layout']['updatemenus'] = [
{
'buttons': [
{
'args': [None, {'frame': {'duration': 500, 'redraw': True},
'fromcurrent': True, 'transition': {'duration': 300,
'easing': 'quadratic-in-out'}}],
'label': 'Play',
'method': 'animate'
},
{
'args': [[None], {'frame': {'duration': 0, 'redraw': False},
'mode': 'immediate',
'transition': {'duration': 0}}],
'label': 'Pause',
'method': 'animate'
}
],
'direction': 'left',
'pad': {'r': 10, 't': 87},
'showactive': False,
'type': 'buttons',
'x': 0.1,
'xanchor': 'right',
'y': 0,
'yanchor': 'top'
}
]
sliders_dict = {
'active': 0,
'yanchor': 'top',
'xanchor': 'left',
'currentvalue': {
'font': {'size': 20},
'prefix': 'Date:',
'visible': True,
'xanchor': 'right'
},
'transition': {'duration': 300, 'easing': 'linear'},
'pad': {'b': 10, 't': 50},
'len': 0.9,
'x': 0.1,
'y': 0,
'steps': []
}
# Create inital data
for country in countries:
data_by_day = by_country_data[by_country_data['Date'] == day]
data_by_day = data_by_day.nlargest(10,'Total confirmed cases (cases)')
data_by_day_and_country = data_by_day[data_by_day['Entity'] == country]
data_dict = {
'type': 'bar',
'x': list(data_by_day_and_country['Entity']),
'y': list(data_by_day_and_country['Total confirmed cases (cases)']),
'name': country,
'marker_color': bar_colors['China'],
}
fig_dict['data'].append(data_dict)
# Make frames for animation
for day in days:
data_by_day = by_country_data[by_country_data['Date'] == day]
data_by_day = data_by_day.nlargest(10,'Total confirmed cases (cases)')
data_by_day.sort_values(by='Total confirmed cases (cases)', ascending= False, inplace=True)
countries = data_by_day['Entity'].unique()
frame = {'data': [], 'name': day}
for country in countries:
#data_by_day = by_country_data[by_country_data['Date'] == day]
#data_by_day = data_by_day.nlargest(10,'Total confirmed cases (cases)')
data_by_day_and_country = data_by_day[data_by_day['Entity'] == country]
data_dict = {
'type': 'bar',
'x': list(data_by_day_and_country['Entity']),
'y': list(data_by_day_and_country['Total confirmed cases (cases)']),
'name': country,
'marker_color': bar_colors[country]
}
frame['data'].append(data_dict)
fig_dict['frames'].append(frame)
slider_step = {'args': [
[day],
{'frame': {'duration': 1000, 'redraw': False},
'mode': 'immediate',
'transition': {"duration": 300}}
],
'label': day,
'method': 'animate'}
sliders_dict['steps'].append(slider_step)
fig_dict['layout']['sliders'] = [sliders_dict]
fig = go.Figure(fig_dict)
fig.show()
From the visualization it becomes apparent that China was the leader for number of cases across the world, however towards the end of the data set showing the first 4 months of the spread the United States took over as the most affected country by Covid-19 by a wide margin.
This question revolves around trying to figure out if there are any major factors we can retrieve from our data that affect the spread of and survival rate of Covid-19. This section below shows the code we used to try to answer this question.
# pull out income data into it's own subset data frame, then each income category
income_data = covid19_data[(covid19_data['Entity'] == 'High income') | (covid19_data['Entity'] == 'Low income')| (covid19_data['Entity'] == 'Lower middle income') |(covid19_data['Entity'] == 'Upper middle income')]
highincome = income_data[income_data['Entity']=="High income"]
lowincome = income_data[income_data['Entity']=="Low income"]
lmincome = income_data[income_data['Entity']=="Lower middle income"]
umincome = income_data[income_data['Entity']=="Upper middle income"]
# Plot the income data over time using the total confirmed cases as the y axis value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=highincome.Date, y=highincome["Total confirmed cases (cases)"], name='High Income',
line=dict(color='rgb(0,128,255)', width=4)))
fig.add_trace(go.Scatter(x=umincome.Date, y=umincome["Total confirmed cases (cases)"], name='Upper Middle Income',
line=dict(color='rgb(153,229,255)', width=4)))
fig.add_trace(go.Scatter(x=lmincome.Date, y=lmincome["Total confirmed cases (cases)"], name='Lower Middle Income',
line=dict(color='rgb(255,153,153)', width=4)))
fig.add_trace(go.Scatter(x=lowincome.Date, y=lowincome["Total confirmed cases (cases)"], name='Low Income',
line=dict(color='rgb(255,0,0)', width=4)))
fig.update_layout(
title="Total confirmed cases of Covid-19 of income classes from Jan 2020 to Apr 2020",
xaxis_title="Date",
yaxis_title="Total number of cases of Covid-19",
template="plotly_dark",
font=dict(
family="arial",
size=18,
)
)
# Plot the income data over time using the total deaths as the y axis value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=highincome.Date, y=highincome["Total confirmed deaths (deaths)"], name='High Income',
line=dict(color='rgb(0,128,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=umincome.Date, y=umincome["Total confirmed deaths (deaths)"], name='Upper Middle Income',
line=dict(color='rgb(153,229,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lmincome.Date, y=lmincome["Total confirmed deaths (deaths)"], name='Lower Middle Income',
line=dict(color='rgb(255,153,153)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lowincome.Date, y=lowincome["Total confirmed deaths (deaths)"], name='Low Income',
line=dict(color='rgb(255,0,0)', width=4, dash="dot")))
fig.update_layout(
title="Total confirmed deaths from Covid-19 of income classes from Jan 2020 to Apr 2020",
xaxis_title="Date",
yaxis_title="Confirmed deaths from Covid-19",
template="plotly_dark",
font=dict(
family="arial",
size=18,
)
)
# Plot the ratio of the total confirmed deaths against the total confirmed cases as the y axis value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=highincome.Date, y=(highincome["Total confirmed deaths (deaths)"]/highincome["Total confirmed cases (cases)"]), name='High Income',
line=dict(color='rgb(0,128,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=umincome.Date, y=(umincome["Total confirmed deaths (deaths)"]/umincome["Total confirmed cases (cases)"]), name='Upper Middle Income',
line=dict(color='rgb(153,229,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lmincome.Date, y=(lmincome["Total confirmed deaths (deaths)"]/lmincome["Total confirmed cases (cases)"]), name='Lower Middle Income',
line=dict(color='rgb(255,153,153)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lowincome.Date, y=(lowincome["Total confirmed deaths (deaths)"]/lowincome["Total confirmed cases (cases)"]), name='Low Income',
line=dict(color='rgb(255,0,0)', width=4, dash="dot")))
fig.update_layout(
title="Ratio of deaths to cases of Covid-19 from each income class over time",
xaxis_title="Date",
yaxis_title="Deaths per Case from Covid-19",
template="plotly_dark",
font=dict(
family="arial",
size=18,
)
)
The conclusion that can be drawn is that at the end of time period of the data given, there is very little difference in the number of deaths per case between income classes. The biggest outlier was the high income category with a 6% mortality rate compared to the others that hovered around 4%.
# pull out world data into it's own subset data frame
world_data = covid19_data[(covid19_data['Code'] == 'OWID_WRL')]
display(world_data)
w_rows, w_cols = world_data.shape
w_list = list(range(0, w_rows))
w_dat = world_data["Total confirmed cases (cases)"]
# calculate polynomial regression
z = np.polyfit(w_list, w_dat, 6)
f = np.poly1d(z)
#calculate best fit
y_new = f(w_list)
ybar = np.sum(w_dat)/len(w_dat)
ssreg = np.sum((y_new-ybar)**2)
sstot = np.sum((w_dat - ybar)**2)
r_squared = ssreg/sstot
print("The R-Squared value of the polynomial regression line to the 6th Degree is ", r_squared)
#plot the scatter plot of the world data
polyfit = go.Scatter(
x=world_data.Date,
y=y_new,
mode='lines',
marker=go.Marker(color='rgb(31, 119, 90)'),
name='Trendline'
)
worlddat = go.Scatter(x=world_data.Date, y=(world_data["Total confirmed cases (cases)"]),
name='World',
mode="markers",
marker=dict(color='red', size=3),
line=dict(color='rgb(0,128,255)', width=4, )
)
fig = go.Figure()
fig.add_trace(worlddat)
fig.add_trace(polyfit)
fig.update_layout(
title="Confirmed cases worldwide over time",
xaxis_title="Date",
yaxis_title="Confirmed cases of Covid-19",
template="plotly_dark",
font=dict(family="arial",size=18),
annotations=
[dict(
x=50,
y=700000,
text="-5.81e-05 X^6 - 0.01825X^5 - 2.057X^4 + 104.2X^3 - 2328X^2 + 1.956e+04X - 3.461e+04",
showarrow=False,
font=dict(family="arial",size=12, color='rgb(31, 119, 90)' ),
)]
)
The data set provided gives us a good foundation for visualizing the spread of Covid-19. The ability to see how the virus spread across the world throughout the timescale provides some insight to the current globalization of our society and the ease of which people can move about the world. It would appear that for the most part, death rates of the virus are indiscriminate of income levels.